The abundances reported by a purification affinity experiment can be used to infer if two proteins interact. This data takes the form of an abundance measured across a number of experiments for every protein involved in the experiment. If a protein is abundant it is more likely to interact with other proteins.
Missing values will be treated as they were for the affinity feature extraction. This means that we must pickle an average value for this feature to use over missing values on the full training set.
In [1]:
    
cd ../..
    
    
In [3]:
    
import csv
    
In [4]:
    
f = open("datasource.abundance.tab","w")
c = csv.writer(f,delimiter="\t")
# just the abundance feature
c.writerow(["forGAVIN/pulldown_data/dataset/ppi_ab_entrez.csv",
            "forGAVIN/pulldown_data/dataset/abundance.Entrez.db","ignoreheader=1;zeromissinginternal=1"])
f.close()
    
In [6]:
    
import sys
    
In [7]:
    
sys.path.append("opencast-bio/")
    
In [8]:
    
import ocbio.extract
    
In [10]:
    
!git annex unlock forGAVIN/pulldown_data/dataset/abundance.Entrez.db
    
    
In [11]:
    
assembler = ocbio.extract.FeatureVectorAssembler("datasource.abundance.tab",verbose=True)
    
    
In [14]:
    
assembler.assemble("forGAVIN/pulldown_data/pulldown.interactions.Entrez.tsv",
                   "features/pulldown.interactions.interpolate.abundance.targets.txt",
                   verbose=True, missinglabel="0")
    
    
In [15]:
    
y = loadtxt("features/pulldown.interactions.interpolate.abundance.targets.txt")
    
In [30]:
    
print "Average value of abundance feature: {0}".format(mean(y[y>1]))
    
    
In [17]:
    
import pickle
    
In [43]:
    
f = open("forGAVIN/pulldown_data/dataset/abundance.average.pickle","wb")
pickle.dump([mean(y)],f)
f.close()
    
To make sure that linear regression isn't going to work better on this dataset than it did with the affinity notebook we should try to fit a linear regression model in this case as well:
In [34]:
    
#loading X vectors
X = loadtxt("features/pulldown.interactions.interpolate.vectors.txt")
    
In [35]:
    
import sklearn.utils
    
In [36]:
    
X,y = sklearn.utils.shuffle(X,y)
    
In [37]:
    
import sklearn.cross_validation
    
In [38]:
    
kf = sklearn.cross_validation.KFold(y.shape[0],10)
    
In [40]:
    
import sklearn.linear_model
    
In [41]:
    
scores = []
for train,test in kf:
    #split the data
    X_train, X_test, y_train, y_test = X[train], X[test], y[train], y[test]
    #train the classifier
    linreg = sklearn.linear_model.LinearRegression()
    linreg.fit(X_train,y_train)
    #test the classifier
    scores.append(linreg.score(X_test,y_test))
    
    
Got tired of waiting.
In [42]:
    
print scores
    
    
Looks like there's very little advantage to a linear regression model here.